In many government applications we often find that information aboutentities, such as persons, are available in disparate data sources such aspassports, driving licences, bank accounts, and income tax records. Similarscenarios are commonplace in large enterprises having multiple customer,supplier, or partner databases. Each data source maintains different aspects ofan entity, and resolving entities based on these attributes is a well-studiedproblem. However, in many cases documents in one source reference those inothers; e.g., a person may provide his driving-licence number while applyingfor a passport, or vice-versa. These links define relationships betweendocuments of the same entity (as opposed to inter-entity relationships, whichare also often used for resolution). In this paper we describe an algorithm tocluster documents that are highly likely to belong to the same entity byexploiting inter-document references in addition to attribute similarity. Ourtechnique uses a combination of iterative graph-traversal, locality-sensitivehashing, iterative match-merge, and graph-clustering to discover uniqueentities based on a document corpus. A unique feature of our technique is thatnew sets of documents can be added incrementally while having to re-resolveonly a small subset of a previously resolved entity-document collection. Wepresent performance and quality results on two data-sets: a real-world databaseof companies and a large synthetically generated `population' database. We alsodemonstrate benefit of using inter-document references for clustering in theform of enhanced recall of documents for resolution.
展开▼